Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
The IODS course is to learn how to use R, RStudio, R Markdown, and GitHub to apply certain statistical methods of data science. The link to my Github repository: https://github.com/yingjchen/IODS-project.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Fri Dec 8 22:10:59 2023"
The text continues here.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Fri Dec 8 22:10:59 2023"
Here we go again…
df = read.csv('./data/learning2014.txt', sep = ',', stringsAsFactors = F, header = T)
dim(df)
## [1] 166 7
head(df)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
str(df)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The students2014 data contains 166 observations and 7 variables (gender, age, attitude, deep, stra, surf, points).
par(mfrow = c(2,4))
barplot(table(df$gender), main = 'gender')
hist(df$age)
hist(df$attitude)
hist(df$deep)
hist(df$stra)
hist(df$surf)
hist(df$points)
pairs(df[, -1])
lm.1 <- lm(points~attitude+deep+stra, data = df)
summary(lm.1)
##
## Call:
## lm(formula = points ~ attitude + deep + stra, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
remove the two variables (surf and stra) which did not have a statistically significant relationship with the target variable (exam points) and fit the regression model with the remaining variables
lm.2 <- lm(points~attitude, data = df)
summary(lm.2)
##
## Call:
## lm(formula = points ~ attitude, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
R-squareds tell how large proportion of the variance in Y the line explains compared to the total variance in Y .
summary(lm.2)$r.squared
## [1] 0.1905537
In this case, the linear model on attitude explains around 20% of the variation in exam points and therefore leaves about 80% of variation in exam points as something that cannot be explained by (a linear effect of) attitude only.
par(mfrow = c(2,2))
plot(lm.2)
Residuals vs. fitted should not show a pattern where the distribution of residuals varies depending on the fitted values. Thus, is is good if the red line showing the mean of data points is close to a horizontal line.
QQ-plot should ideally be on the diagonal line, in which case the residuals are Normally distributed and SEs and P-values of the model coefficients are reliable. In this case, the QQ-plot deviated from the diagonal line.
Residuals vs. leverage should not show points that reach outside the curved areas of Cook’s distance 1. Otherwise, such points would have a high influence on the slope of the linear model, and hence the lm() estimates from such data would not be reliable.
date()
## [1] "Fri Dec 8 22:10:59 2023"
alc <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", sep=",", header=TRUE)
dim(alc)
## [1] 370 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The alc data presents student achievement in secondary education of two Portuguese schools, containing 370 students and 35 variables.
Choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption:
I am interested in exploring the relationships between high/low alcohol consumption and 4 variables (‘age’, ‘Pstatus’, ‘studytime’, ‘goout’).
I hypothesize that there may be a positive correlation between age and going out with the likelihood of alcohol consumption, while study time is expected to be negatively correlated with alcohol consumption. If parents are apart, students tend to be alcoholic.
Note: age - student’s age (numeric: from 15 to 22) Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots).
par(mfrow = c(2,2))
hist(alc$age, main = 'age')
barplot(table(alc$Pstatus), main = 'Pstatus')
barplot(table(alc$studytime), main = 'studytime')
barplot(table(alc$goout), main = 'goout')
Most student are 15-18 years old. Most parents are living together.
library(ggplot2)
#alc$studytime = as.factor(alc$studytime)
#alc$goout = as.factor(alc$goout)
#alc$age = as.factor(alc$age)
alc$high_use = as.factor(alc$high_use)
g1 <- ggplot(alc, aes(x = Pstatus, y = alc_use))
g1 + geom_boxplot() +
geom_jitter(aes(color=high_use), size=0.4, alpha=0.9) +
ylab("alcohol assumption") + theme_classic()
g2 <- ggplot(alc, aes(x = high_use, y = studytime))
g2 + geom_boxplot() +geom_jitter(color="black", size=0.4, alpha=0.9) + ylab("studytime") +theme_classic()
g3 <- ggplot(alc, aes(x = high_use, y = goout))
g3 + geom_boxplot() +geom_jitter(color="black", size=0.4, alpha=0.9) + ylab("goout") + theme_classic()
g4 <- ggplot(alc, aes(x = high_use, y = age))
g4 + geom_boxplot() +geom_jitter(color="black", size=0.4, alpha=0.9) + ylab("age") +theme_classic()
#install.packages('crosstable')
#library(crosstable)
#crosstable(alc, c(age, studytime), by=high_use)
glm.1 <- glm(high_use ~ age + studytime + Pstatus + goout, data = alc, family = "binomial")
summary(glm.1)
##
## Call:
## glm(formula = high_use ~ age + studytime + Pstatus + goout, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8778 1.8521 -2.094 0.036288 *
## age 0.1185 0.1094 1.083 0.279024
## studytime -0.6276 0.1671 -3.757 0.000172 ***
## PstatusT -0.1064 0.4070 -0.261 0.793752
## goout 0.7256 0.1183 6.132 8.7e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 386.18 on 365 degrees of freedom
## AIC: 396.18
##
## Number of Fisher Scoring iterations: 4
#remove variables with p-values larger than 0.05, and fit the LR model again
glm.2 <- glm(high_use ~ studytime + goout, data = alc, family = "binomial")
summary(glm.2)
##
## Call:
## glm(formula = high_use ~ studytime + goout, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0686 0.5116 -4.044 5.26e-05 ***
## studytime -0.6224 0.1660 -3.750 0.000177 ***
## goout 0.7427 0.1175 6.322 2.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 387.41 on 367 degrees of freedom
## AIC: 393.41
##
## Number of Fisher Scoring iterations: 4
pred.2 = predict(glm.2, newdata = alc[, c( 'studytime' ,'goout')], type = 'response')
pred.2.label = ifelse(pred.2 >= 0.5, TRUE, FALSE)
table(pred.2.label)
## pred.2.label
## FALSE TRUE
## 308 62
table( truelabels = alc$high_use, predictions = pred.2.label)
## predictions
## truelabels FALSE TRUE
## FALSE 238 21
## TRUE 70 41
mosaicplot(table( truelabels = alc$high_use, predictions = pred.2.label))
Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results.
inaccuracy = (21+70)/nrow(alc)
inaccuracy
## [1] 0.2459459
Compared with the performance of the model with performance achieved by some simple guessing strategy (inaccuracy = 0.5), the logistic regression showed better performance.
date()
## [1] "Fri Dec 8 22:11:01 2023"
library(MASS)
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
dim(Boston)
## [1] 506 14
Boston data contains 506 observations and 14 numeric variables.
pairs(Boston)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# for continuous variables
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle")
# center and standardize variables
boston_scaled <- as.data.frame(scale(Boston))
# class of the boston_scaled objects
class(boston_scaled)
## [1] "data.frame"
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
The scaled Boston data contains 506 observations and 14 variables.
#create a factor variable
boston_scaled <- as.data.frame(boston_scaled)
#boston_scaled$crim <- as.numeric(boston_scaled$crim)
#create a quantile vector
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
#Divide the dataset to train and test sets,
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2549505 0.2450495 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 0.7854515 -0.8977016 -0.14929469 -0.8523828 0.39401939 -0.8503861
## med_low -0.1408928 -0.3141000 -0.04298342 -0.5483428 -0.14697258 -0.2790003
## med_high -0.3642391 0.1927470 0.16512651 0.4299369 0.05003221 0.3866833
## high -0.4872402 1.0170298 -0.08661679 1.0445566 -0.48071276 0.7974988
## dis rad tax ptratio black lstat
## low 0.8358908 -0.6887728 -0.7246034 -0.3927698 0.38259990 -0.75253470
## med_low 0.2850585 -0.5514748 -0.4733725 -0.1032334 0.32079610 -0.10431333
## med_high -0.3918346 -0.3844363 -0.2877691 -0.3261231 0.06862002 0.08673234
## high -0.8382711 1.6390172 1.5146914 0.7818116 -0.75762774 0.87829194
## medv
## low 0.449021087
## med_low -0.001666894
## med_high 0.132070749
## high -0.666101252
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.107122255 0.711496163 -1.06303332
## indus 0.005706519 -0.351111904 0.12931212
## chas -0.096786117 -0.120677663 0.05449836
## nox 0.382443245 -0.706634934 -1.38997070
## rm -0.096251501 -0.098382339 -0.19130057
## age 0.254386861 -0.249550217 0.04805445
## dis -0.121899310 -0.238730774 0.13415876
## rad 3.089745900 0.864823049 -0.19110509
## tax -0.064689723 0.033550886 0.75640036
## ptratio 0.106045707 0.099821528 -0.35112093
## black -0.122405329 -0.003069977 0.10430904
## lstat 0.228842668 -0.285440158 0.26813602
## medv 0.216554203 -0.403182346 -0.22117231
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9504 0.0369 0.0127
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 22 9 0 0
## med_low 8 11 4 0
## med_high 0 12 15 0
## high 0 0 0 21
The LDA model showed better predictability in the classes of ‘low’ and ‘med_low’.
library(MASS)
data("Boston")
boston_scaled.2 <- as.data.frame(scale(Boston))
class(Boston)
## [1] "data.frame"
dist_eu <- dist(boston_scaled.2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man<- dist(boston_scaled.2, method = 'manhattan')
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
library(ggplot2)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
plot(x = 1:k_max, y = twcss, type = "l" )
# k-means clustering
# choose the number of clusters as 3
km <- kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
date()
## [1] "Fri Dec 8 22:11:03 2023"
library(readr)
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(human)
## [1] 155 9
library(tibble)
human_ <- column_to_rownames(human, "Country")
summary(human_)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
#graphical overview of the data
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(human_, progress = FALSE)
There is higher positive correlations between Ado.Birth and Mat.Mor,
Edu.Exp and Edu2.FM.
There is higher negative correlations between Life.Exp and Mat.Mor, Edu.Exp and Mat.Mor, Edu2.FM and Mat.Mor, Ado.Birth and Life.Exp, Ado.Birth and Edu.Exp.
pca_human <- prcomp(human_)
biplot(pca_human, choices = 1:2, xlabs=rep(".", nrow(human_)))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
human_scale <- scale(human_)
pca_human_scale <- prcomp(human_scale)
biplot(pca_human_scale, choices = 1:2, xlabs=rep(".", nrow(human_scale)))
s <- summary(pca_human)
pca_pr <- round(1*s$importance[2, ], digits = 5)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
s_scale <- summary(pca_human_scale)
pca_pr_scale <- round(1*s_scale$importance[2, ], digits = 5)
pca_pr_scale
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
#View(tea)
head(tea)
## breakfast tea.time evening lunch dinner always home
## 1 breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 2 breakfast Not.tea time Not.evening Not.lunch Not.dinner Not.always home
## 3 Not.breakfast tea time evening Not.lunch dinner Not.always home
## 4 Not.breakfast Not.tea time Not.evening Not.lunch dinner Not.always home
## 5 breakfast Not.tea time evening Not.lunch Not.dinner always home
## 6 Not.breakfast Not.tea time Not.evening Not.lunch dinner Not.always home
## work tearoom friends resto pub Tea How sugar
## 1 Not.work Not.tearoom Not.friends Not.resto Not.pub black alone sugar
## 2 Not.work Not.tearoom Not.friends Not.resto Not.pub black milk No.sugar
## 3 work Not.tearoom friends resto Not.pub Earl Grey alone No.sugar
## 4 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone sugar
## 5 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
## 6 Not.work Not.tearoom Not.friends Not.resto Not.pub Earl Grey alone No.sugar
## how where price age sex SPC Sport age_Q
## 1 tea bag chain store p_unknown 39 M middle sportsman 35-44
## 2 tea bag chain store p_variable 45 F middle sportsman 45-59
## 3 tea bag chain store p_variable 47 F other worker sportsman 45-59
## 4 tea bag chain store p_variable 23 M student Not.sportsman 15-24
## 5 tea bag chain store p_variable 48 M employee sportsman 45-59
## 6 tea bag chain store p_private label 21 M student sportsman 15-24
## frequency escape.exoticism spirituality healthy diuretic
## 1 1/day Not.escape-exoticism Not.spirituality healthy Not.diuretic
## 2 1/day escape-exoticism Not.spirituality healthy diuretic
## 3 +2/day Not.escape-exoticism Not.spirituality healthy diuretic
## 4 1/day escape-exoticism spirituality healthy Not.diuretic
## 5 +2/day escape-exoticism spirituality Not.healthy diuretic
## 6 1/day Not.escape-exoticism Not.spirituality healthy Not.diuretic
## friendliness iron.absorption feminine sophisticated
## 1 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 2 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## 3 friendliness Not.iron absorption Not.feminine Not.sophisticated
## 4 Not.friendliness Not.iron absorption Not.feminine sophisticated
## 5 friendliness Not.iron absorption Not.feminine Not.sophisticated
## 6 Not.friendliness Not.iron absorption Not.feminine Not.sophisticated
## slimming exciting relaxing effect.on.health
## 1 No.slimming No.exciting No.relaxing No.effect on health
## 2 No.slimming exciting No.relaxing No.effect on health
## 3 No.slimming No.exciting relaxing No.effect on health
## 4 No.slimming No.exciting relaxing No.effect on health
## 5 No.slimming No.exciting relaxing No.effect on health
## 6 No.slimming No.exciting relaxing No.effect on health
#dim(tea)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free")
#MCA
#install.packages('FactoMineR')
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
plot(mca, invisible=c("ind"), graph.type = "classic")
date()
## [1] "Fri Dec 8 22:11:09 2023"
#BPRS (long form)
library(ggplot2)
library(dplyr)
library(tidyr)
BPRS <- read.table('https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt', header = T)
BPRSL <- read.csv('./data/BPRSL.txt', header = T, sep = '\t', stringsAsFactors = F)
# Draw the plot
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = scale(bprs)) %>%
ungroup()
# Plot again with the standardised bprs
ggplot(BPRSL, aes(x = week, y = stdbprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_y_continuous(name = "standardized bprs")
BPRSL8S <- BPRSL %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
BPRSL8S1 <- BPRSL8S %>%
filter(mean < 60)
t.test(mean ~ treatment, data = BPRSL8S1, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by treatment
## t = 0.52095, df = 37, p-value = 0.6055
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -4.232480 7.162085
## sample estimates:
## mean in group 1 mean in group 2
## 36.16875 34.70395
library(dplyr)
library(tidyr)
# Add the baseline from the original data as a new variable to the summary data
BPRSL8S2 <- BPRSL8S %>%
mutate(baseline = BPRS$week0)
# Fit the linear model with the mean as the response
fit <- lm(mean~baseline + treatment, data = BPRSL8S2)
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + treatment, data = BPRSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1729 -4.5994 0.1088 4.6703 21.0656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.07897 4.55710 2.870 0.00675 **
## baseline 0.49127 0.08943 5.493 3.05e-06 ***
## treatment2 -0.58879 2.49584 -0.236 0.81480
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.872 on 37 degrees of freedom
## Multiple R-squared: 0.4494, Adjusted R-squared: 0.4196
## F-statistic: 15.1 on 2 and 37 DF, p-value: 1.605e-05
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- pivot_longer(RATS, cols=-c(ID,Group), names_to = "WD",values_to = "Weight") %>% mutate(Time = as.integer(substr(WD,3,4))) %>% arrange(Time)
# Plot the RATSL data
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line()
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATSL, REML = FALSE)
summary(RATS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (1 | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1333.2 1352.2 -660.6 1321.2 170
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.5386 -0.5581 -0.0494 0.5693 3.0990
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 1085.92 32.953
## Residual 66.44 8.151
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 244.06890 11.73107 20.80
## Time 0.58568 0.03158 18.54
## Group2 220.98864 20.23577 10.92
## Group3 262.07955 20.23577 12.95
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.090
## Group2 -0.575 0.000
## Group3 -0.575 0.000 0.333
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
summary(RATS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (Time | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1194.2 1219.6 -589.1 1178.2 168
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2259 -0.4322 0.0555 0.5637 2.8825
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 1139.2004 33.7520
## Time 0.1122 0.3349 -0.22
## Residual 19.7479 4.4439
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 246.46821 11.80888 20.871
## Time 0.58568 0.08548 6.852
## Group2 214.55803 20.16986 10.638
## Group3 258.91288 20.16986 12.837
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.166
## Group2 -0.569 0.000
## Group3 -0.569 0.000 0.333
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
# perform an ANOVA test on the two models
anova(RATS_ref1, RATS_ref)
## Data: RATSL
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## RATS_ref 6 1333.2 1352.2 -660.58 1321.2
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2 142.94 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "Observed weight (grams)") +
theme(legend.position = "top")
(more chapters to be added similarly as we proceed with the course!)